Coupled Auto-Enrollment and Speaker Identification Platform for Real-Time Applications

Nicolas Shu

Committee Members

    David V. Anderson, Ph.D. (advisor)

    Justin Romberg, Ph.D.

    Matthieu Bloch, Ph.D.

    Larry Heck, Ph.D.

    Mikle South, Ph.D.

Georgia Institute of Technology PhD Dissertation Defense
Machine Learning

Monday, 4 December 2023

You may access this presentation at https://nicolasshu.com/thesis_defense

Outline:

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Cameo appearances are our way to do acknowledgements throughout the presentation

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Offline Algorithm

Line of Best Fit

Online Algorithm

You have all the data from the beginning

You receive the data one piece at a time

Intro

Online Algorithm

Offline Algorithm

Line of Best Fit

You have all the data from the beginning

You can have all the data but load one piece of data at a time

Real-Time data

Intro

Speaker Identification

Intro

Movies, TV Shows, Podcasts

 - I want to be Barbie and Ken for Halloween

 - Oh yeah?

 - I'll be Barbie and you'll be Ken

Intro

Movies, TV Shows, Podcasts

But what about events in real-time?

 - I want to be Barbie and Ken for Halloween

 - Oh yeah?

 - I'll be Barbie and you'll be Ken

Intro

But what about events in real-time?

- We should get rid of tipping culture. Waiters and waitresses deserve fair livable wages.

Intro

Max Braverman

Autism Spectrum Disorder

Monitoring of At-Risk Populations

Intro

What are some solutions?

Intro

What are some solutions?

Personally Check in

Hire a Care Giver

Surveillance

Intro

Surveillance

https://www.peoplemanagement.co.uk/article/1747153/one-in-seven-workers-say-employer-monitoring-has-increased-during-covid

Intro

Audio is capable of capturing

  • Vocal interactions
  • Dangerous events which manifest audio cues

Different modalities are capable of capturing different information

Intro

Microphones have been widely accepted in homes

Intro

What is the common denominator?

Intro

What is the common denominator?

We need a system that

1. Can identify new incoming speakers and re-identify them

2. Can operate in real-time in an online algorithm

Intro

There are many things to consider

Intro

There are many things to consider

Where to put sensors?

Intro

There are many things to consider

Where to put sensors?

Intro

There are many things to consider

Where to put sensors?

System Infrastructure

Intro

There are many things to consider

Where to put sensors?

System Infrastructure

Intro

There are many things to consider

Where to put sensors?

System Infrastructure

Speaker Identification Engine

Intro

There are many things to consider

Where to put sensors?

System Infrastructure

Speaker Identification Engine

Intro

There are many things to consider

Where to put sensors?

System Infrastructure

Speaker Identification Engine

Real-Time

Intro

There are many things to consider

Where to put sensors?

System Infrastructure

Speaker Identification Engine

Real-Time

Intro

There are many things to consider

Where to put sensors?

System Infrastructure

Speaker Identification Engine

Real-Time

User Interface

Intro

There are many things to consider

Where to put sensors?

System Infrastructure

Speaker Identification Engine

Real-Time

User Interface

Intro

Intro

Sensor Localization

Intro

Sensor Localization

Intro

Different Interiors have different topologies

Sensor Localization

Intro

Quick chat about topologies...

Convex

Sensor Localization

Intro

Quick chat about topologies...

Convex

Sensor Localization

Intro

Quick chat about topologies...

Convex

Non-Convex

Sensor Localization

Intro

Quick chat about topologies...

Convex

Non-Convex

Sensor Localization

Intro

Quick chat about topologies...

Convex

Non-Convex

Simply connected

Simply connected

Sensor Localization

Intro

Quick chat about topologies...

Convex

Non-Convex

Simply connected

Simply connected

Non-Convex

Non-Simply connected

Non-Convex

Non-Simply connected

Sensor Localization

Intro

Sensor Localization

Intro

Sensor Localization

Intro

Non-simply Connected Environment

Sensor Localization

Intro

Instrument of Choice: LiDAR

Sensor Localization

Intro

Instrument of Choice: LiDAR

Sensor Localization

Intro

Instrument of Choice: LiDAR

Problems

 - The data for each room is in local coordinates to the LiDAR (i.e. all centered at 0)

 - Going from room to room, the orientation may change

Need to quickly manipulate the data, but no GUI was found

Sensor Localization

Intro

Sensor Localization

Intro

LiDAR + GUI = Home Map

Sensor Localization

Intro

Example of TSRB's 5th Floor

Sensor Localization

Intro

Sensor Localization

Intro

Sensor Localization

Home Mapping

Intro

Where do we place the sensors?

Sensor Localization

Intro

Where do we place the sensors?

Sensor Localization

Intro

Where do we place the sensors?

Sensor Localization

Intro

Maximum Coverage Problem

Voronoi Tesselations

Sensor Localization

Intro

A framework was built...

which can be applied to non-simply connected environments

A non-simply connected environment

Sensor Localization

Intro

A framework was built...

which can be applied to non-simply connected environments

A non-simply connected environment

Sensor Localization

Intro

After a lot of adjustments, we created an algorithm which would allow for a proper visible tessellation

Next, networked control algorithms to maximize coverage

Sensor Localization

Intro

Lloyd's Algorithm works well in convex spaces

... but it doesn't work well in non-simply connected spaces

Sensor Localization

Intro

... but it doesn't work well in non-simply connected spaces

Plus, it is dependent on good initial conditions

Sensor Localization

Intro

Plus, it is dependent on good initial conditions

Sensor Localization

Intro

So, Lloyd's Algorithm was expanded to be more robust and exploratory

Sensor Localization

Intro

Let's see this in action at Nick's Apartment

Sensor Localization

Intro

Let's see this in action at Nick's Apartment

Sensor Localization

Intro

Sensor Localization

Home Mapping

Sensor Localization

Intro

1. Tools to Quickly Map Environment

2. Networked Control to Maximize Coverage

Sensor Localization

Home Mapping

Maximum Coverage

Sensor Localization

Intro

Sensor Localization

System Infrastructure

Maximum Coverage

Home Mapping

Sensor Localization

Intro

System Infrastructure

Sensor Localization

Intro

Let's talk about privacy for a moment...

Sensor Localization

Intro

System Infrastructure

Let's talk about privacy for a moment...

Sensor Localization

Intro

System Infrastructure

Let's talk about privacy for a moment...

Sensor Localization

Intro

System Infrastructure

Let's talk about privacy for a moment...

No recording should ever leave the environment

Sensor Localization

Intro

System Infrastructure

How would we create the infrastructure?

Sensor Localization

Intro

System Infrastructure

How would we create the infrastructure?

Socket

Programming!

Sensor Localization

Intro

System Infrastructure

How would we create the infrastructure?

audiosockets

our Python package

Sensor Localization

Intro

System Infrastructure

How would we create the infrastructure?

audiosockets

Sensor Localization

Intro

System Infrastructure

How would we create the infrastructure?

audiosockets

...

Raspberry Pi

Server

Sensor Localization

Intro

System Infrastructure

Recorder

Processor

Server

Sensor Localization

Intro

System Infrastructure

Recorder

Processor

Server

Sensor Localization

Intro

System Infrastructure

Recorder

Processor

Recorder

Processor

Server

Sensor Localization

Intro

System Infrastructure

Recorder

Processor

Recorder

Processor

Recorder

Recorder

Processor

Processor

Server

Sensor Localization

Intro

System Infrastructure

How do we solve this?

Sensor Localization

Intro

System Infrastructure

Recorder

Processor

Server

Server

How do we solve this?

Sensor Localization

Intro

System Infrastructure

Recorder

Processor

Server

Server

How do we solve this?

Sensor Localization

Intro

System Infrastructure

Processor

Server

Recorder

Recorder

Server

How do we solve this?

Sensor Localization

Intro

System Infrastructure

How do we solve this?

Processor

Server

Recorder

Recorder

Recorder

Recorder

audiosockets

Server

Sensor Localization

Intro

System Infrastructure

Sensor Localization

System Infrastructure

Maximum Coverage

Home Mapping

System Infrastructure

Sensor Localization

Intro

1. System capable of communicating over network

2. Architecture merges different processes, reducing computational resources

3. Scalable

Sensor Localization

Speaker Identification

System Infrastructure

Maximum Coverage

Home Mapping

System Infrastructure

Sensor Localization

Intro

System Infrastructure

Sensor Localization

Intro

Speaker Identification

Speaker Identification

Sensor Localization

Intro

System Infrastructure

I. Detection of New Classes

II. Identification of Speakers

audio

Is this a new

speaker?

yes

no

Identify speaker

Enroll / Register

new speaker    

k \in \{1, ..., K\}
k^* = \argmax P(k)

Speaker =

Speaker =

K'

Two Parts:

Traditional Classification

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob. Graph. Models

Support Vector Machines

Neural Networks

Decision Trees

Traditional Classification

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Traditional Classification

Prob. Graph. Models

Support Vector Machines

Neural Networks

Decision Trees

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Traditional Classification

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Traditional Classification

But this requires a lot of data!

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Traditional Classification

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Traditional Classification

Learn how to do a task well

Few-Shot Classification

(Meta-Learning)

Learn how to learn tasks well

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Traditional Classification

Learn how to do a task well

Few-Shot Classification

(Meta-Learning)

Learn how to learn tasks well

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Traditional Classification

Learn how to do a task well

Few-Shot Classification

(Meta-Learning)

Learn how to learn tasks well

Traditional Speaker Identification

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Traditional Classification

Learn how to do a task well

Few-Shot Classification

(Meta-Learning)

Learn how to learn tasks well

Traditional Speaker Identification

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Few-Shot Classification for Speaker Identification is good!

Learn how to learn tasks well

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Are there algorithms for speaker identification?

X-Vector Networks

Input

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

Layer 7

Layer 8

x-vector

1

2

3

t-2

t-1

t

t+1

t+2

T

t

t

Time-Delay Neural Network

DNN

Stats Pooling

t-2

t-1

t

t+1

t+2

t-2

t-1

t

t+1

t+2

t-3

t+3

1

2

3

T

Speaker Identification

Sensor Localization

Intro

System Infrastructure

... but the original x-vector system relies on traditional classification

 

What about few-shot learning?

Prototypical Networks for Few-Shot Learning

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Prototypical Networks for Few-Shot Learning

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Prototypical Networks for Few-Shot Learning

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Prototypical Networks for Few-Shot Learning

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Prototypical Networks for Few-Shot Learning

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Prototypical Networks for Few-Shot Learning

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Prototypical Networks for Few-Shot Learning

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Prototypical Networks for Few-Shot Learning

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Prototypical Networks for Few-Shot Learning

Support Set

Query Set

Used to create prototypes

(i.e. centroids)

Used for training

N_S = 3
N_Q = 2

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Randomly choose classes:

Input

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

Layer 7

Layer 8

x-vector

1

2

3

t-2

t-1

t

t+1

t+2

T

t

t

Time-Delay Neural Network

DNN

Stats Pooling

t-2

t-1

t

t+1

t+2

t-2

t-1

t

t+1

t+2

t-3

t+3

1

2

3

T

X-Vector System as a Prototypical Network

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Layer 7

Layer 8

DNN

X-Vector System as a Prototypical Network

Input

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

x-vector

1

2

3

t-2

t-1

t

t+1

t+2

T

t

t

Time-Delay Neural Network

Stats Pooling

t-2

t-1

t

t+1

t+2

t-2

t-1

t

t+1

t+2

t-3

t+3

1

2

3

T

Speaker Identification

Sensor Localization

Intro

System Infrastructure

X-Vector System as a Prototypical Network

Input

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

x-vector

1

2

3

t-2

t-1

t

t+1

t+2

T

t

t

Time-Delay Neural Network

Stats Pooling

t-2

t-1

t

t+1

t+2

t-2

t-1

t

t+1

t+2

t-3

t+3

1

2

3

T

Speaker Identification

Sensor Localization

Intro

System Infrastructure

X-Vector System as a Prototypical Network

Input

Layer 1

Layer 2

Layer 3

Layer 4

Layer 5

Layer 6

x-vector

1

2

3

t-2

t-1

t

t+1

t+2

T

t

t

Time-Delay Neural Network

Stats Pooling

t-2

t-1

t

t+1

t+2

t-2

t-1

t

t+1

t+2

t-3

t+3

1

2

3

T

Euclidean Distance

Assumption:

The latent subspace creates features which have Gaussian-like characteristics

p_\theta(y=k| \boldsymbol{x}) = \frac{e^{dist(f_\theta(\boldsymbol{x}),\boldsymbol{c}_k)}}{\sum_{k'}e^{dist(f_\theta(\boldsymbol{x}),\boldsymbol{c}_k)}}

Show formula

Speaker Identification

Sensor Localization

Intro

System Infrastructure

How would this setup perform under different _____?

Number of Samples in Support/Query Sets

Number of Classes/Speakers

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vector dimension

512-dim

128-dim

16-dim

So let's talk about the training procedure!

C_{train}
C_{valid}
C_{test}

Training x-vector system as a prototypical network

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Dataset: VoxCeleb1

C_{train}
C_{valid}

X-Vector

System

x-vectors

prototypical

loss

C_{test}

Training x-vector system as a prototypical network

Speaker Identification

Sensor Localization

Intro

System Infrastructure

C_{train}
C_{valid}

X-Vector

System

x-vectors

C_{test}

Training x-vector system as a prototypical network

Speaker Identification

Sensor Localization

Intro

System Infrastructure

So how did these "clustered" x-vectors looked like?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

So how did these "clustered" x-vectors looked like?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

So how did these "clustered" x-vectors looked like?

X-Vector System

Speaker Identification

Sensor Localization

Intro

System Infrastructure

So how did these "clustered" x-vectors looked like?

X-Vector System

Speaker Identification

Sensor Localization

Intro

System Infrastructure

So how did these "clustered" x-vectors looked like?

X-Vector System

Speaker Identification

Sensor Localization

Intro

System Infrastructure

So how did these "clustered" x-vectors looked like?

X-Vector System

Speaker Identification

Sensor Localization

Intro

System Infrastructure

So how did these "clustered" x-vectors looked like?

X-Vector System

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Let's visually inspect whether x-vectors clustered

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Let's visually inspect whether x-vectors clustered

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Test Accuracy on Few-Shot Speaker Identification

Speaker Identification

Sensor Localization

Intro

System Infrastructure

C_{test}
C_{valid}
C_{train}

Speaker Identification

Sensor Localization

Intro

System Infrastructure

C_{test}
C_{valid}
C_{train}

speakers with

< 41 mins

speakers with > 41 mins

80%

10%

10%

U_{train}
S_{train}
S_{valid}
S_{test}
U_{valid}
U_{test}
U_{train}
S_{train}
S_{valid}
S_{test}
U_{valid}
U_{test}

X-Vector System

Speaker Identification

Sensor Localization

Intro

System Infrastructure

speakers with

< 41 mins

speakers with > 41 mins

80%

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}
S_{valid}

10%

S_{test}
U_{train}
S_{train}
U_{valid}
U_{test}

seen

seen

seen

Speaker Identification

Sensor Localization

Intro

System Infrastructure

speakers with

< 41 mins

speakers with > 41 mins

80%

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

seen

seen

seen

S_{valid}

10%

S_{test}
U_{train}
S_{train}
U_{valid}
U_{test}

Speaker Identification

Sensor Localization

Intro

System Infrastructure

speakers with

< 41 mins

speakers with > 41 mins

80%

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

seen

seen

seen

S_{valid}

10%

S_{test}
U_{train}
S_{train}
U_{valid}
U_{test}

Mahalanobis Distances

d_M(x | \mu, \Sigma) = \sqrt{(x-\mu)^T \Sigma^{-1} (x-\mu)}
\mathcal{N}(x|\mu, \Sigma) = \left[(2\pi)^D |\Sigma|\right]^{-\frac{1}{2}} e^{-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)}
\mathcal{N}(x|\mu, \Sigma) = \left[(2\pi)^D |\Sigma|\right]^{-\frac{1}{2}} e^{-\frac{1}{2} d_M^2(x|\mu, \Sigma)}

Speaker Identification

Sensor Localization

Intro

System Infrastructure

speakers with

< 41 mins

speakers with > 41 mins

80%

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

seen

seen

seen

S_{valid}

10%

S_{test}
U_{train}
S_{train}
U_{valid}
U_{test}

Mahalanobis Distances

\mathcal{N}(x|\mu, \Sigma) = \left[(2\pi)^D |\Sigma|\right]^{-\frac{1}{2}} e^{-\frac{1}{2} d^2_M(x|\mu, \Sigma)}
\propto e^{-\frac{1}{2} d_M^2(x|\mu, \Sigma)}
\sum_{i=1}^5 d_M^2(x_i|\mu_k, \Sigma_k)
\propto \left[\prod _{i=1}^5 \mathcal{N}\left(x_i | \mu_k, \Sigma_k\right) \right]^{-1}

Unseen

Seen

d_M^2(x_1 | \mu_k, \Sigma_k)
d_M^2(x_2 | \mu_k, \Sigma_k)
d_M^2(x_3 | \mu_k, \Sigma_k)
d_M^2(x_4 | \mu_k, \Sigma_k)
d_M^2(x_5 | \mu_k, \Sigma_k)

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker Identification

Sensor Localization

Intro

System Infrastructure

speakers with > 41 mins

80%

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

seen

seen

seen

S_{valid}

10%

S_{test}
S_{train}
U_{valid}
U_{test}
\gamma

speakers with > 41 mins

80%

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

seen

seen

seen

S_{valid}

10%

S_{test}
S_{train}
U_{valid}
U_{test}

Speaker Identification

Sensor Localization

Intro

System Infrastructure

\gamma
S_{train}
S_{valid}
U_{valid}
U_{test}

speakers with > 41 mins

80%

10%

seen

seen

seen

S_{test}

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

unseen

unseen

Speaker Identification

Sensor Localization

Intro

System Infrastructure

\gamma
S_{train}
S_{valid}
U_{valid}
U_{test}

speakers with > 41 mins

80%

10%

seen

seen

seen

S_{test}

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

unseen

unseen

?

?

?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

\gamma

Speaker Identification

Sensor Localization

Intro

System Infrastructure

S_{train}
S_{valid}
U_{valid}

speakers with > 41 mins

80%

10%

seen

seen

seen

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

unseen

unseen

U_{test}
S_{test}

?

\lessgtr \gamma

?

\lessgtr \gamma

?

\lessgtr \gamma
\gamma
\gamma

Compute F1 scores

S_{train}
S_{valid}
U_{valid}

speakers with > 41 mins

80%

10%

seen

seen

seen

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

unseen

unseen

U_{test}
S_{test}

?

\lessgtr \gamma

?

\lessgtr \gamma

?

\lessgtr \gamma

Speaker Identification

Sensor Localization

Intro

System Infrastructure

F1 Score on Detection between Seen / Unseen Classes on Test Set

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Sensor Localization

Speaker Identification

Detection of New Classes

System Infrastructure

Maximum Coverage

Home Mapping

System Infrastructure

Sensor Localization

Intro

1. Found that x-vector dimensions can be reduced to 32-dim

2. Created method to detect new classes based on few-shot learning clustering

Sensor Localization

Speaker Identification

Detection of New Classes

System Infrastructure

Maximum Coverage

Home Mapping

Adaptive Few-Shot Speaker ID

System Infrastructure

Sensor Localization

Intro

VoxConverse Dataset

Speaker Identification

Sensor Localization

Intro

System Infrastructure

t-distributed Stochastic Neighbor Embeddings (t-SNE)

2D Data

3D Data

What about 4 dimensions? 6 dimensions? 32 dimensions?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

t-distributed Stochastic Neighbor Embeddings (t-SNE)

What about 4 dimensions? 6 dimensions? 32 dimensions?

Our desired x-vectors have 32 dimensions!

 

We can use t-SNE to check for qualitatively indications that the clusters have been clustered

Speaker Identification

Sensor Localization

Intro

System Infrastructure

edixl

fkvvo

Bare t-SNE Projections of x-vectors

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

Is this a new

speaker?

yes

no

Identify speaker

Enroll / Register

new speaker    

k \in \{1, ..., K\}
k^* = \argmax P(k)

Speaker =

Speaker =

K'

This setup has many caveats!

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

Is this a new

speaker?

yes

no

Identify speaker

Enroll / Register

new speaker    

k \in \{1, ..., K\}
k^* = \argmax P(k)

Speaker =

Speaker =

K'

Prob 1: The system will not know the actual labels as it creates predicted labels

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 1: The system will not know the actual labels as it creates predicted labels

Solution:

Matching with Hungarian Algorithm

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Matching Algorithms: Hungarian Algorithm

$10

$40

$50

$50

$80

$80

$50

$70

$60

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Matching Algorithms: Hungarian Algorithm

$10

$40

$50

$50

$70

$60

$50

$80

$80

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Cost Matrix

Matching Algorithms: Hungarian Algorithm

$10

$40

$50

$50

$70

$60

$50

$80

$80

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Matching Algorithms: Hungarian Algorithm

Prob 1: The system will not know the actual labels as it creates predicted labels

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Matching Algorithms: Hungarian Algorithm

Prob 1: The system will not know the actual labels as it creates predicted labels

There will be left overs classes  when using a Hungarian Alg

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Matching Algorithms: Greedy Algorithm

Prob 1: The system will not know the actual labels as it creates predicted labels

Greedy algorithms will use up every predicted class found!

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 2: What if the detector has too many false positives?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 2: What if the detector has too many false positives?

I

L

L

L

L

K

K

J

J

A

C

C

B

B

B

B

B

D

D

E

E

E

F

G

G

H

This is VERY segmented!

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 2: What if the detector has too many false positives?

I

L

L

L

L

K

K

J

J

A

C

C

B

B

B

B

B

D

D

E

E

E

F

G

G

H

Hungarian Algorithm

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 2: What if the detector has too many false positives?

I

L

L

L

L

K

K

J

J

A

C

C

B

B

B

B

B

D

D

E

E

E

F

G

G

H

Greedy Algorithm

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 3: What if the detector has too many false negatives?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 3: What if the detector has too many false negatives?

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

C

C

B

B

B

B

B

B

B

B

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 3: What if the detector has too many false negatives?

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

A

C

C

B

B

B

B

B

B

B

B

Hungarian Algorithm & Greedy Algorithm

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Prob 2: What if the detector has too many false positives?

Prob 3: What if the detector has too many false negatives?

High Segmentation

Near Perfect Greedy Matching

Low Hungarian Matching

Low Segmentation

Many mismatches in Greedy Matching

Many mismatches in Hungarian Matching

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Results will be displayed in this manner:

Time (s)

Speaker

true label

predicted label

true label

predicted label

true label

predicted label

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu
\Sigma

Create new

class

Classify as new cluster

Classify as closest cluster

Class

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Experiment 1: Baseline

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Experiment 1: Baseline

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Hungarian

Greedy

Baseline

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Hungarian

Baseline

Conclusion:

  1. The threshold for detection of new speakers does not generalize across datasets
  2. We needed to learn a little about the transfer function

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

80%

10%

10%

S_{train}

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

seen

seen

unseen

unseen

seen

S_{valid}
S_{test}
U_{test}

29 Speakers

U_{valid}

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma

Experiment 2: Using 41min Covariance as Model Cov

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

80%

10%

10%

S_{train}

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

seen

seen

unseen

unseen

seen

S_{valid}
S_{test}
U_{test}

29 Speakers

U_{valid}

Experiment 2: Using 41min Covariance as Model Cov

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

29 Speakers

S_{train}

80%

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

29 Speakers

S_{train}

80%

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma
k=1
k=2
k=3
k=4

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

29 Speakers

S_{train}

80%

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma
k=1
k=2
k=3
k=4
\Sigma_1
\Sigma_2
\Sigma_3
\Sigma_4

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

29 Speakers

S_{train}

80%

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma
\Sigma_1
\Sigma_2
\Sigma_3
\Sigma_4

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

29 Speakers

S_{train}

80%

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma
\Sigma_1
\Sigma_2
\Sigma_3
\Sigma_4

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

29 Speakers

S_{train}

80%

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma
\Sigma_1
\Sigma_2
\Sigma_3
\Sigma_4
median(
median(
)
)
=
\Sigma_{41min}

Experiment 2: Using 41min Covariance as Model Cov

Remember how in VoxCeleb1, we had 29 speakers with more than 41mins of audio?

speakers with > 41 mins

29 Speakers

S_{train}

80%

\Sigma_1
\Sigma_2
\Sigma_3
\Sigma_4
median(
median(
)
)
=
\Sigma_{41min}

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma

Experiment 2: Using 41min Covariance as Model Cov

Baseline

Using 41min Covariance

Result comparison between

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Hungarian

Greedy

Using 41min Covariance

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

Covariance Adaptation

\Sigma

Experiment 3: 5s Initial Covariance Adaptation

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

Covariance Adaptation

\Sigma

xvec queue

\text{if } t < 5s

false

true

\text{if } t = 5s

true

false

Collect the

x-vectors

...

Collection

Experiment 3: 5s Initial Covariance Adaptation

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

Covariance Adaptation

\Sigma

xvec queue

\text{if } t < 5s

true

false

Collect the

x-vectors

\text{if } t = 5s

true

false

...

Collection

Trained Covariance

\Sigma^*

Train covariance

matrix on Collection

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

Covariance Adaptation

\Sigma

xvec queue

\text{if } t < 5s

true

false

Collect the

x-vectors

\text{if } t = 5s

true

false

...

Collection

Train covariance

matrix on Collection

Trained Covariance

\Sigma^*

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

Covariance Adaptation

\Sigma

xvec queue

\text{if } t < 5s

true

false

Collect the

x-vectors

\text{if } t = 5s

true

false

...

Collection

Train covariance

matrix on Collection

Trained Covariance

\Sigma^*

Experiment 3: 5s Initial Covariance Adaptation

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

Covariance Adaptation

\Sigma
\Sigma^*

Experiment 3: 5s Initial Covariance Adaptation

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

Covariance Adaptation

\Sigma
\Sigma^*

Experiment 3: Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Baseline

5s Initial Covariance Adaptation

Result comparison between

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Hungarian

Greedy

5s Initial Covariance Adaptation

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Statistics as Dynamic Models (Algorithmic Statistics)

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Statistics as Dynamic Models (Algorithmic Statistics)

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Statistics as Dynamic Models (Algorithmic Statistics)

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Statistics as Dynamic Models (Algorithmic Statistics)

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Statistics as Dynamic Models (Algorithmic Statistics)

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Statistics as Dynamic Models (Algorithmic Statistics)

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Statistics as Dynamic Models (Algorithmic Statistics)

\mu_T = \frac{1}{T} \sum_{t=1}^{T} x_t
= \mu_{T-1} + \frac{1}{T} (x_T - \mu_{T-1})
\Sigma_{T} = \frac{1}{T-1} \sum_{t=1}^T (x_t - \mu_T) (x_t - \mu_T)^T
= \frac{T-2}{T-1} \Sigma_{t-1} + \frac{1}{T} (x_T - \mu_{T-1}) (x_T - \mu_{T-1}) ^T

Sample Mean at time T

Sample Covariance at time T

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

Class

\Sigma

Experiment 4: Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

Experiment 4: Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

k

Class

Update

\mu_k, \Sigma_k

Experiment 4: Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Baseline

Algorithmic Statistics

Result comparison between

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Hungarian

Greedy

Algorithmic Statistics

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

k

Class

Update

\mu_k, \Sigma_k

Experiment 5: 5s Cov. Adapt + Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

k

Class

Update

\mu_k, \Sigma_k

Covariance Adaptation

\Sigma^*

Experiment 5: 5s Cov. Adapt + Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

k

Class

Update

\mu_k, \Sigma_k

xvec queue

\text{if } t < 5s

true

false

Collect the

x-vectors

\text{if } t = 5s

true

false

...

Collection

Train covariance

matrix on Collection

Trained Covariance

\Sigma^*

Experiment 5: 5s Cov. Adapt + Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

k

Class

Update

\mu_k, \Sigma_k

Covariance Adaptation

\Sigma^*

Experiment 5: 5s Cov. Adapt + Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

k

Class

Update

\mu_k, \Sigma_k

Covariance Adaptation

\Sigma^*

Experiment 5: 5s Cov. Adapt + Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Baseline

5s Cov. Adapt + Algorithmic Stats

Result comparison between

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Hungarian

Greedy

5s Cov. Adapt + Algorithmic Stats

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

k

Class

Update

\mu_k, \Sigma_k

Covariance Adaptation

\Sigma^*

Experiment 6: 5s Cov. Adapt + Algorithmic Mean

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

xvec queue

xvec queue

\mu

Create new

class

Classify as new cluster

Classify as closest cluster

Compute joint Maha. dists to closest cluster

Mahalanobis Classifier

\Sigma
k

Class

k

Class

Covariance Adaptation

\Sigma^*

Update

\mu_k

Experiment 6: 5s Cov. Adapt + Algorithmic Mean

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Baseline

5s Cov. Adapt + Algorithmic Mean

Result comparison between

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Hungarian

Greedy

5s Cov. Adapt + Algorithmic Mean

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Sensor Localization

Speaker Identification

Detection of New Classes

System Infrastructure

Maximum Coverage

Home Mapping

Adaptive Few-Shot Speaker ID

System Infrastructure

Speaker Identification

Sensor Localization

Intro

Sensor Localization

Speaker Identification

Detection of New Classes

Adaptive Few-Shot Speaker ID

System Infrastructure

Real-Time Platform

Maximum Coverage

Home Mapping

System Infrastructure

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Sensor Localization

Maximum Coverage

Home Mapping

Speaker Identification

Detection of New Classes

Adaptive Few-Shot Speaker ID

Real-Time Platform

System Infrastructure

Speaker Identification

Sensor Localization

Intro

The Aware Home

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Brian Jones

Mapping the Aware Home

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Mapping the Aware Home

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Finding Optimal Locations for Max. Coverage at Aware Home

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Intel i7-4770

4 cores / 8 threads

@ 3.40 GHz

June 2013

32 GB

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Recorder

Recorder

Recorder

Recorder

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Recorder

Recorder

Recorder

Recorder

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Recorder

Recorder

Recorder

Recorder

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker

Recorder

Recorder

Recorder

Recorder

\Delta x

Register New Class/Speaker

Speaker = New Speaker

Speaker = Closest Cluster

\text{if } \Delta x > \gamma

true

false

Update

speaker distr.

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker

Recorder

Recorder

Recorder

Recorder

\Delta x

Register New Class/Speaker

Speaker = New Speaker

Speaker = Closest Cluster

\text{if } \Delta x > \gamma

true

false

Update

speaker distr.

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker

Recorder

Recorder

Recorder

Recorder

\Delta x

Register New Class/Speaker

Speaker = New Speaker

Speaker = Closest Cluster

\text{if } \Delta x > \gamma

true

false

Update

speaker distr.

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker

Recorder

Recorder

Recorder

Recorder

\Delta x

Register New Class/Speaker

Speaker = New Speaker

Speaker = Closest Cluster

\text{if } \Delta x > \gamma

true

false

Update

speaker distr.

Front-End Dashboard

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker

Recorder

Recorder

Recorder

Recorder

\Delta x

Register New Class/Speaker

Speaker = New Speaker

Speaker = Closest Cluster

\text{if } \Delta x > \gamma

true

false

Update

speaker distr.

Front-End Dashboard

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Speaker

Recorder

Recorder

Recorder

Recorder

\Delta x

Register New Class/Speaker

Speaker = New Speaker

Speaker = Closest Cluster

\text{if } \Delta x > \gamma

true

false

Update

speaker distr.

Front-End Dashboard

Creating the Platform

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

x-vectors

Compute joint Maha. dists to closest cluster

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

\mu

Create new

class

Classify as closest cluster

Classify as new cluster

Mahalanobis Classifier

\Sigma

Class

k

Class

k

Update

Covariance Adaptation

\Sigma^*
\mu_k

Audio

x-vector system

Detection of Classes Method could be used for other applications

Compute joint Maha. dists to closest cluster

joint Maha. dist

to closest cluster

\text{if } > \gamma

true

false

\mu

Create new

class

Classify as closest cluster

Classify as new cluster

Mahalanobis Classifier

\Sigma

Class

k

Class

k

Update

Covariance Adaptation

\Sigma^*
\mu_k

Inputs

Detection of Classes Method could be used for other applications

Real-Time Platform

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Contributions of this Thesis

Sensor Localization

Speaker Identification

Real-Time Platform

System Infrastructure

Home Mapping

Maximum Coverage

Speaker

\Delta x

Register New Class/Speaker

Speaker = New Speaker

Speaker = Closest Cluster

\text{if } \Delta x > \gamma

Update

speaker distr.

false

true

Front-End Dashboard

Detection of New Classes

Adaptive Few-Shot Speaker ID

\gamma

?

\lessgtr \gamma

?

\lessgtr \gamma

?

\lessgtr \gamma

seen

seen

seen

joint Maha. dists to closest cluster

\text{if } > \gamma

k = closest cluster

k = new cluster

Mahalanobis Classifier

k
k

Cov

Adapt

\mu_k

new class

F

T

\mu
\Sigma^*

Acknowledgements

Yash Kiarashi

Ratan Singh

Nasim Katebi

Giulia Da Poian

Sanmathi Kamath

Erick Perez

Jim Kinney

Pradyumna Suresha

Arjun Nakum

V. S. Krishna Madala

Chaitra Hegde

Ayse Cakmak

Robert Tweedy

Zifan Jiang

Clayton Feustel

Mohammad

Brandon Carroll

Devon Janke

Siuka Wong

Doug Chau

Brandon Lew

Dennis Delgado

Peiqi Yang

Sandy Wu

Hansol Choi

Sasha Keizs

Kate Lau

Angelica Quintana

Joe Small

Mike Mones

Best-Naz Eshaghi

Krishna Sanka

Uros Kuzmanovic

Mike Moxey

Luis Ortiz

Nicole Nowbahar

Jane Gong

Daniella Corporan

José Magalhães

Joel Corporan

Yash-Yee Logan

Sanghoon Lee

Luis Rosa

Nauman Ahad

Eric Qin

Will Sealy

Brett Ringel

Nathan Glaser

Norh Asmare

Harold Nikoue

Chris James Banks

Akash Patel

Moamen Soliman

Bogdan Vlahov

Adi Kambil

Ambuz Vimal

Mouhyemen Khan

To the people in GT...

William

Gagstetter

Andy Fan

How was the framework for tessellation built?

Sensor Localization

Intro

Making a Framework for Max Coverage

There are a few steps that needed to be accomplished:

  1. Create a virtual environment
  2. Create agents based on a model
  3. Create a swarm based on agents

Sensor Localization

Intro

Creating Agents

Sensor Localization

Intro

Creating Agents

Sensor Localization

Intro

Creating Agents

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

\cap
=

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Creating a Swarm

Sensor Localization

Intro

Can you further describe the networked controls implemented?

Sensor Localization

Intro

Lloyd's Algorithm works well for Simply Connected Environments

u_{i,L}

Sensor Localization

Intro

... but not so much for Non-Simply Connected Envs

Sensor Localization

Intro

u_{i,L}
u_{i,L}

& Good Initial Conds

Using Lloyd's Alg on EP6

Sensor Localization

Intro

Control: Avoid Obstacle

u_{i,ao} = \max \left[ -\log \left( \frac{\|\tilde{u}_{i,ao}\|}{b} \right),0 \right] \cdot \frac{\tilde{u}_{i,ao}}{\| \tilde{u}_{i,ao}}

Sensor Localization

Intro

\begin{aligned} u_{i,p} &= \sum_{j\in N_i} w_{ij} (x_i - x_j) \\ \frac{\partial \mathcal{E}_ij}{\partial x_i} &= w_{ij} (x_i - x_j) \end{aligned}
\begin{aligned} \mathcal{E}_{ij} &= e^{-\alpha \| x_i - x_j \|} \\ \frac{\partial \mathcal{E}_{ij}}{\partial x_{i}} &= e^{-\alpha \| x_i - x_j \|} \\ w_{ij} &= - \frac{e^{-\alpha \| x_i - x_j \|}}{e^ \| x_i - x_j \|} \end{aligned}
u_{i,p} = \sum _{j \in N_i} \underbrace{\frac{-e^{\alpha \|x_i - x_j \|}}{\| x_i - x_j \|}}_{w_{ij}} \cdot (x_i - x_j)

Control: Distance oneself from other Agents

Sensor Localization

Intro

Control:

u_i = k_L u_{i,L} + k_p u_{i,p} + k_{ao} u_{i,ao}

Sensor Localization

Intro

Control: Increase Coverage via Largest Boundary

Assumption:

Large hallway cross-sections lead to larger rooms than small hallway cross-sections

u_{i,b}

Sensor Localization

Intro

Control:

u_i = k_L u_{i,L} + k_p u_{i,p} + k_{ao} u_{i,ao}

Sensor Localization

Intro

Further Results

Sensor Localization

Intro

Can you describe your audiosockets package in more detail?

Python is a linear programming system

  • Synchronous
  • sounddevice package

Sensor Localization

Intro

System Infrastructure

Sensor Localization

Intro

System Infrastructure

Sensor Localization

Intro

System Infrastructure

Sensor Localization

Intro

System Infrastructure

How can I use it?

{
   "PORT": 5050,
   "HEADER": 64,
   "FORMAT": "utf-8",
   "DISCONNECT_MSG": "DISCONNECT",
   "logging_format": "%(asctime)s - %(message)s",
   "logging_level": "info"
}

1. Server Descriptor

Sensor Localization

Intro

System Infrastructure

How can I use it?

{
   "PORT": 5050,
   "HEADER": 64,
   "FORMAT": "utf-8",
   "DISCONNECT_MSG": "DISCONNECT",
   "logging_format": "%(asctime)s - %(message)s",
   "logging_level": "info"
}

1. Server Descriptor

from audiosockets import MailmanSocket

mailman = MailmanSocket("server_info.json")
mailman.start()

2. Start up a server

Sensor Localization

Intro

System Infrastructure

How can I use it?

{
   "PORT": 5050,
   "HEADER": 64,
   "FORMAT": "utf-8",
   "DISCONNECT_MSG": "DISCONNECT",
   "logging_format": "%(asctime)s - %(message)s",
   "logging_level": "info"
}

1. Server Descriptor

from audiosockets import MailmanSocket

mailman = MailmanSocket("server_info.json")
mailman.start()

2. Start up a server

Sensor Localization

Intro

System Infrastructure

How can I use it?

{
   "PORT": 5050,
   "HEADER": 64,
   "FORMAT": "utf-8",
   "DISCONNECT_MSG": "DISCONNECT",
   "logging_format": "%(asctime)s - %(message)s",
   "logging_level": "info"
}

1. Server Descriptor

from audiosockets import MailmanSocket

mailman = MailmanSocket("server_info.json")
mailman.start()

2. Start up a server

3. Start Recorder Client

from audiosockets import RecorderSocket

recorder = RecorderSocket("server_info.json")
recorder.start()

Sensor Localization

Intro

System Infrastructure

How can I use it?

{
   "PORT": 5050,
   "HEADER": 64,
   "FORMAT": "utf-8",
   "DISCONNECT_MSG": "DISCONNECT",
   "logging_format": "%(asctime)s - %(message)s",
   "logging_level": "info"
}

1. Server Descriptor

from audiosockets import MailmanSocket

mailman = MailmanSocket("server_info.json")
mailman.start()

2. Start up a server

3. Start Recorder Client

from audiosockets import RecorderSocket

recorder = RecorderSocket("server_info.json")
recorder.start()

Sensor Localization

Intro

System Infrastructure

How can I use it?

{
   "PORT": 5050,
   "HEADER": 64,
   "FORMAT": "utf-8",
   "DISCONNECT_MSG": "DISCONNECT",
   "logging_format": "%(asctime)s - %(message)s",
   "logging_level": "info"
}

1. Server Descriptor

from audiosockets import MailmanSocket

mailman = MailmanSocket("server_info.json")
mailman.start()

2. Start up a server

3. Start Recorder Client

from audiosockets import RecorderSocket

recorder = RecorderSocket("server_info.json")
recorder.start()

4. Start a Processor

from audiosockets import BaseProcessorSocket
from audiosockets.utils import LogMelSpectrogram

class LogMelSpecProcessor(BaseProcessorSocket):
   def __init__(self,*args, **kwargs):
       super().__init__(*args, **kwargs)

   def process_data(self,data):
       fs = data["fs"]
       audio = data["data"]
       lms = LogMelSpectrogram(fs)(audio)
       print(lms.shape)

processor = LogMelSpecProcessor("VAD", "server_info.json")
processor.start()

Sensor Localization

Intro

System Infrastructure

How can I use it?

{
   "PORT": 5050,
   "HEADER": 64,
   "FORMAT": "utf-8",
   "DISCONNECT_MSG": "DISCONNECT",
   "logging_format": "%(asctime)s - %(message)s",
   "logging_level": "info"
}

1. Server Descriptor

from audiosockets import MailmanSocket

mailman = MailmanSocket("server_info.json")
mailman.start()

2. Start up a server

3. Start Recorder Client

from audiosockets import RecorderSocket

recorder = RecorderSocket("server_info.json")
recorder.start()

4. Start a Processor

from audiosockets import BaseProcessorSocket
from audiosockets.utils import LogMelSpectrogram

class LogMelSpecProcessor(BaseProcessorSocket):
   def __init__(self,*args, **kwargs):
       super().__init__(*args, **kwargs)

   def process_data(self,data):
       fs = data["fs"]
       audio = data["data"]
       lms = LogMelSpectrogram(fs)(audio)
       print(lms.shape)

processor = LogMelSpecProcessor("VAD", "server_info.json")
processor.start()

Can we visualize the distances to validate our expectations?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Let's visualize the distances to validate our expectations

Speaker Identification

Sensor Localization

Intro

System Infrastructure

S_{test}

Let's visualize the distances to validate our expectations

Speaker Identification

Sensor Localization

Intro

System Infrastructure

S_{test}

Let's visualize the distances to validate our expectations

Speaker Identification

Sensor Localization

Intro

System Infrastructure

S_{test}

Let's visualize the distances to validate our expectations

Speaker Identification

Sensor Localization

Intro

System Infrastructure

U_{test}
S_{test}

Let's visualize the distances to validate our expectations

Speaker Identification

Sensor Localization

Intro

System Infrastructure

U_{test}
S_{test}

Speaker Identification

Sensor Localization

Intro

System Infrastructure

U_{test}
S_{test}

Visually inspecting the Mahalanobis distances from Seen/Unseen Test Data

Speaker Identification

Sensor Localization

Intro

System Infrastructure

What happens if we vary the number of speakers enrolled?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

S_{train}
S_{valid}
U_{valid}
U_{test}

speakers with > 41 mins

80%

10%

seen

seen

seen

S_{test}

10%

Stats for Gaussians

\begin{align*} \mu &\in \mathbb{R}^{K \times F} \\ \Sigma &\in \mathbb{R}^{K \times F \times F} \end{align*}

unseen

unseen

?

?

?

What happens if we vary the number of speakers enrolled?

29 Speakers

Speaker Identification

Sensor Localization

Intro

System Infrastructure

What happens if we vary the number of speakers enrolled?

Speaker Identification

Sensor Localization

Intro

System Infrastructure

EER on Detection between Seen / Unseen Classes

Create an illustration of covariances increasing.

Specifically ovals increasing in different eigenvectors.

Speaker Identification

Sensor Localization

Intro

System Infrastructure

Adaptation of Model Covariance

Brighter colors indicative of later stages

Speaker Identification

Sensor Localization

Intro

System Infrastructure